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Abstract 



We consider in this paper top-k query answering in social tagging (or beck- 
on marking) applications. This problem requires a significant departure from existing, 
socially agnostic techniques. In a network-aware context, one can (and should) 
q . exploit the social links, which can indicate how users relate to the seeker and how 
much weight their tagging actions should have in the result build-up. We propose 
\q ' an algorithm that has the potential to scale to current applications. While the 
problem has already been considered in previous literature, this was done either 
under strong simplifying assumptions or under choices that cannot scale to even 
\Q ' moderate-size real-world applications. We first revisit a key aspect of the problem, 
which is accessing the closest or most relevant users for a given seeker. We describe 
how this can be done on the fly (without any pre-computations) for several possi- 
ble choices - arguably the most natural ones - of proximity computation in a user 
network. Based on this, our top-k algorithm is sound and complete, while address- 
ing the applicability issues of the existing ones. Moreover, it performs significantly 
better and, importantly, it is instance optimal in the case when the search relies ex- 
clusively on the social weight of tagging actions. To further reduce response times, 
we then consider directions for efficiency by approximation. Extensive experiments 
on real world data show that our techniques can drastically improve the response 
time, without sacrificing precision. 

1. Introduction 

Unprecedented volumes of data are now at everyone's fingertips on the World Wide 
Web. The ability to query them efficiently and effectively, by fast retrieval and ranking 
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algorithms, has largely contributed to the rapid growth of the Web, making it simply 
irreplaceable in our every day life. 

A new dynamics to this development has been recently brought by the social Web, 
applications that are centered around users, their relationships and their data. Indeed, 
user-generated content is becoming a significant and highly qualitative portion of the 
Web. To illustrate, the most visited Web site today is a social one. This calls for adapted, 
efficient retrieval techniques, which can go beyond a classic Web search paradigm where 
data is decoupled from the users querying it. 

An important class of social applications are the collaborative tagging applications, also 
known as social bookmarking applications, with popular examples including Del.icio.us, 
StumbleUpon or Flickr. Their general setting is the following: 

• users form a social network, which may reflect proximity, similarity, friendship, 
closeness, etc, 

• items from a public pool of items (e.g., document, URLs, photos, etc) are tagged 
by users with keywords, for purposes such as description and classification, or to 
facilitate later retrieval, 

• users search for items having certain keywords (i.e., tags) or they are recommended 
items, e.g., based on proximity at the level of tags. 

Collaborative tagging, and social applications in general, can offer an entirely new per- 
spective to how one searches and accesses information. The main reason for this is that 
users can (and often do) play a role at both ends of the information flow, as producers 
and also as seekers of information. Consequently, finding the most relevant items that 
are tagged by some keywords should be done in a network- aware manner. In particular, 
items that are tagged by users who are "closer" to the seeker - where the term closer 
depends on model assumptions that will be clarified shortly - should be given more 
weight than items that are tagged by more distant users. 

We consider in this paper the problem of top-A; retrieval in collaborative tagging 
systems. We investigate it with a focus on efficiency, targeting techniques that have the 
potential to scale to current applications on the Wetfl 

in an online context where the 
social network, the tagging data and even the seekers' search ingredients can change at 
any moment. In this context, a key sub-problem for top- A; retrieval that we need to 
address is computing scores of top-k candidates by iterating not only through the most 
relevant items with respect to the query, but also (or mostly) by looking at the closest 
users and their tagged items. 

We associate with the notion of social network a rather general interpretation, as a user 
graph whose edges are labeled by social scores, which give a measure of the proximity 
or similarity between two users. These are then exploitable in searches, as they say 
how much weight one's tagging actions should have in the result build-up. For example, 

1 The most popular ones have user bases of the order of millions and huge repositories of data; today's 
most accessed social Web application, which also provides tagging and searching functionalities, has 
more than half a billion registered users. 
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Figure 1: A collaborative tagging scenario and its social network. 

even for tagging applications where an explicit social network does not exist or is not 
exploitable, one may use the tagging history to build a network based on similarity in 
tagging and items of interest. While we focus mainly on bookmarking applications, we 
believe that these represent a good abstraction for other types of social applications, to 
which our techniques could directly apply. 

Example 1. Consider the collaborative tagging configuration of Figure H Users have 
associated lists of tagged documents and they are interconnected by social links. Each 
link is labeled by its (social) score, assumed to be in the [0, 1] interval. Let us consider 
user Alice in the role of the seeker. The user graph is not complete, as the figure shows, 
and only two users have an explicit social score with respect to Alice. For the remaining 
ones, Danny, . . . , Jim, only an implicit social score could be computed from the existing 
links if a precise measure of their relevance with respect to Alice 's queries is necessary 
in the top-k retrieval. 

Let us assume that Alice looks for the top two documents that are tagged with both 
news and site. Looking at Alice's immediate neighbors and their respective documents, 
intuitively, D3 should have a higher score than D4, since the former is tagged by a more 
relevant user (Bob, having the maximal social score relative to Alice). If we expand 
the search to the entire graph, the score of D4 may however benefit from the fact that 
other users, such as Eve or even Holly, also tagged it with news or site. Furthermore, 
documents such as D2 and Dl may also be relevant for the top-2 result, even though 
they were tagged only by users who are indirectly linked to Alice. 

Under certain assumptions to be clarified shortly, the top-2 documents for Alice 's 
query will be, in descending score order, D4 and D2. The rest of the paper will present 
the underlying model and algorithms that allow us to build this answer. 
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Main related work. Classic top- A; retrieval algorithms, such as Fagin's thresh- 
old algorithm [12] and the no random access (NRA) algorithm, rely on precomputed 
inverted-index lists with exact scores for each query term (in our setting, a term is 
a tag). Revisiting the setting in Figure [TJ we would have two per-tag inverted lists 
IL(news) = {DA :7,D2: 2,D1 : 2, D3 : 1,DQ :1,D5: 1} and IL(site) = {D2 : 5, DA : 
2, D3 : 1,D6 : 1,D1 : 1,D5 : 1}, which give the number of times a document has been 
tagged with the given tag. 

When user proximity is an additional ingredient in the top-A; retrieval process, a 
direct network-aware adaptation of the threshold algorithm and variants would need 
precomputed inverted-index lists for each user-tag pair. For instance, if we interpret 
explicit links in the user graph as friendship, ignoring the link scores, and only tagging 
by direct friends matters, Alice's lists would be ILAUce{news) = {DA : 1,D6 : 1} and 
I L AHce{site) = {D3 : 1,D6 : 1}. Other 18 such lists would be required and, clearly, this 
would have prohibitive space and computing costs in a real-world setting. Amer-Yahia et 
al. [1] is the first to address this issue, considering the problem of network-aware search 
in collaborative tagging sites, though by a simplified flavor. The authors consider an 
extension to classic top-A; retrieval in which user proximity is seen as a binary function 
(0-1 proximity): only a subset of the users in the network are selected and can influence 
the top-k result. This introduces two strong simplifying restrictions: (i) only documents 
tagged by the selected users should be relevant in the search, and (ii) all the users thus 
selected are equally important. The base solution of [T] is to keep for each tag-item pair, 
instead of the detailed lists per user-tag pair, only an upper-bound value on the number 
of taggers. For instance, the upper-bound for (news, DA) would be 2, since for any user 
there are at most two neighbors who tagged DA with news. This is called the Global 
UPPER-BOUND strategy. A more refined version, which trades space for efficiency, keeps 
such upper-bound values within clusters of users, instead of the network as a whole. 

Only in Schenkel et al. [18], the network-aware retrieval problem for collaborative 
tagging is considered under a general interpretation, the one we also adopt in this paper. 
It considers that even users who are only indirectly connected to the seeker can be 
relevant for the top-A; result. Their CONTEXTMERGE algorithm follows the intuition 
that the users closest to the seeker will contribute more to the score of an item, thus 
maximizing the chance that the item will remain in the final top-A;. The authors describe 
a hybrid approach in which, at each step, the algorithm chooses either to look at the 
documents tagged by the closest unseen user or at the tag-document inverted lists (a 
seeker agnostic choice). In order to obtain the next (unseen) closest user at any given 
step, the algorithm precomputes in advance the proximity value for all possible pairs 
of users. These values are then stored in ranked lists (one list per user), and a simple 
pointer increment allows to obtain the next relevant user. 

Example 2. Consider the network of Fig. [IJ With respect to seeker Alice, the list of 
users ranked by proximity would be {Bob : 0.9, Danny : 0.81, Charlie : 0.6, Frank : 
O.A,Eve : 0.3, George : 0.2, Holly : 0.1, Ida : 0.1, Jim : 0.05}, with proximity between 
two users built as the maximal product of scores over paths linking them (formalized in 

SecUonlMD- 
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The main drawbacks of [18J are scalability and applicability. Clearly, precomputing a 
weighted transitive closure over the entire network has a high cost in terms of space and 
computation in even moderate-size social networks. More importantly, keeping these 
proximity lists up to date when they reflect tagging similaritjo (as advocated in |18|). 
would simply be unfeasible in real-world settings, which are highly dynamic. (We revisit 
these considerations in Section [6j) 

Main contributions. We propose an algorithm for top-k answering in collaborative 
tagging, which has the potential to scale to current applications and beyond, in an online 
context where network changes and tagging actions are frequent. For this algorithm, we 
first address a key aspect: accessing efficiently the closest users for a given seeker. We 
describe how this can be done on the fly (without any pre-computations) for a large 
family of functions for proximity computation in a social network, including the most 
natural ones (and the one assumed in |18|). The interest in doing this is threefold: 

• we can support full scoring personalization, where each user issuing queries can 
define her own way to rank items, through parameters and score function choices, 

• we can iterate over the relevant users in more efficient manner, since a typical 
network can easily fit in main-memory; this can spare the potentially huge disk 
volumes required by [18j's algorithm (see Section E]), while also having the potential 
to run faster. 

• social link updates are no longer an issue; in particular, when the social network 
depends on the tagging history, we can keep it up-to-date and, by it, all the 
proximity values at any given moment, with little overhead. 

Based on this, our top- A; algorithm TOPKS is sound and complete. We show that, 
when the search relies exclusively on the social weight of tagging actions, it is instance 
optimal in a large and important class of algorithms. Extensive experiments on real world 
data show that our algorithm performs significantly better than existing techniques, with 
up to 50% improvement (see Section [7j). 

For further efficiency, we then consider directions for approximate results. Our ap- 
proaches present the advantages of negligible memory consumption (they rely on concise 
statistics about the user network) and reduced computation overhead. Moreover, these 
statistics can be maintained up to date with limited effort, even when the social network 
is built based on tagging history. Experiments show that approximate search techniques 
can drastically improve the response time, reaching around 25% of the running time of 
the exact approach, without sacrificing precision. 

The main focus of our work is on the social aspects of top-k retrieval in collaborative 
tagging applications, and our techniques are designed to perform best in settings where 
tagging actions are mostly (if not exclusively) viewed through the lens of social relevance. 

Outline. The rest of the paper is organized as follows. In Section [2] we formalize 
the top-k retrieval problem in collaborative tagging applications. We describe a key 

2 Tagging similarity may indeed be a more pertinent proximity measure than friendship for top- A: search 
in bookmarking applications. 
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aspect of our approach, the on-the-fly computation of proximity in Section 12.11 We 
then describe our top-A; algorithm, first in an exclusively social form, in Section [31 and 
show it is instance optimal in Section 13.11 The general algorithm is then presented 
in Section HI Two approaches for improving efficiency by approximation are given in 
Section [5j We discuss applicability and scalability issues in Section [6j Experimental 
results are presented in Section [7J We overview the related work in Section [HJ We 
discuss future work and we conclude in Section [9j 

2. General Setting 

We consider a social setting in which we have a set of items (could be text documents, 
URLs, photos, etc) X = {ii, . . . , i m }, each tagged with one or more distinctive tags from 
a dictionary of tags T = {ti, t 2 , . . . , ti} by one or more users from hi = {ui, . . . , u n }. We 
assume that users form an undirected weighted graph G = (U, E, a) called the social 
network. In G, nodes represent users and o is a function that associates to each edge 
e = {u\ y u<i) a value in (0, 1], called the 'proximity (or social) score between u\ and u%. 

Given a seeker user s, a keyword query Q = (ti, ...,t r ) (a set of r distinct tags) and 
an integer value k, the top-k retrieval problem is to compute the (possibly ranked) list of 
the k items having the highest scores with respect to the seeker and query. 

We describe next the score model for this problem. 

Extending the model for social tagging systems presented in [lj, we also assume the 
following two relations for tags: 

• tagging: Tagged(v,i,t): says that a user v tagged the item i with tag t, 

• tag proximity: SimTag(tx,t2, A): says that tags t\ and t 2 are similar, with 
similarity value A G (0, 1). 

We assume that a user can tag a given item with a given tag at most once. We first 
model for a user, item and tag triple (s,i,t) the score of item i for the given seeker s 
and tag t. This is denoted score(i \ s,t). Generally, 



where fr(i | s,t) is the overall term frequency of item i for seeker s and tag t, and h is 
a positive monotone function. 

The overall term frequency function fr(i | s, t) is defined as a combination of a network- 
dependent component and a document-dependent one, as follows: 



The former component, tf(t, i), is the term frequency of t in i, i.e., the number of times 
i was tagged with t. The latter component stands for social frequency, a measure that 
depends on the seekerU 

3 The linear combination of Eq. (12.21) is one that is widely used when a local retrieval score and a global 
one are to be combined, e.g., in spatial search (7j or in social search |18j . However, any monotone 
combination of the two score components can be used in these approaches, as in ours. 





fr{i | s,t) — a x tf(t,i) + (1 — a) x sf{i \ s,t). 
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If we consider that each user brings her own weight (proximity) to the score of an 
item, we can define the measure of social frequency as follows: 

8f(i\s,t)= ( 2 - 3 ) 

I Tagged(v,i,t))} 

Then, given a query Q as a set of tags (ti, . . . , t r ), the overall score of % for seeker s and 
query Q, 

score{i \ s,Q) = g(score(i \ s, ti), . . . , score{i \ s, t r )), 

is obtained using a monotone aggregate function g over the individual scores for each 
tag. In this paper, the aggregation function g is assumed to be a summation, g = 

Extended proximity. The above scoring model takes into account only the neigh- 
borhood of the seeker (the users directly connected to her). But this can be extended 
to deal also with users that are indirectly connected to the seeker, following a natural 
interpretation that user links (e.g., similarity or trust) are (at least to some extent) 
transitive. We denote by o~ + an extended proximity, which is to be computable from a 
for any pair of users connected by a path in the network. Now, <j + can replace a in 
the definition of social frequency we consider before (Eq. ()2.3|) ). yielding an overall item 
scoring scheme that depends on the entire network instead of only the seeker's vicinity. 
We discuss shortly possible alternatives for a + by means of aggregating o values along 
paths in the graph. In the rest of this paper, when we talk about proximity we refer to 
the extended one. 

For a given seeker u, by her proximity vector we denote the list of users with non-zero 
proximity with respect to u, ordered in descending order of these proximity values. 

Remark 1. In Eq. (12. 2p . the a parameter allows to tune the relative importance of 
the social component with respect to classic term frequency. When a is valued 1, the 
score becomes network-independent. On the other hand, when a is valued the score 
depends exclusively on the social network. 

Remark 2. Note that a network in which all the user pairs have a proximity score 
of 1 amounts to the classical document retrieval setting (i.e., the result is independent 
of the user asking the query). 

Remark 3. Tag similarity can be integrated into Eq. f)2.3p . e.g., by setting a threshold 
r s.t. if SimTag(t,t' , A), with A above r, and Tagged(v,i,t'), we also add a(u,v) to 
sf(i | u,t). For the sake of simplicity this is ignored in this paper, but remains an 
integral part of the model. 

Remark 4. Note that queries are not assumed to use only tags from 77 For any tag 
outside this dictionary, items will obviously have a score of 0. 

2.1. Computing <x + 

We describe in this section a key aspect of our algorithm for top-A; search, namely on- 
the-fly computation of proximity values with respect to a seeker s. The issue here is 
to facilitate at any given step the retrieval of the most relevant unseen user u in the 
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network, along with her proximity value a + (s,u). This user will have the potential to 
contribute the most to the partial scores of items that are still candidates for the top-k 
result, by Eq. Q2HD and Q . 

We start by discussing possible candidates for cr + , arguably the most natural ones, 
drawing inspiration from studies in the area of trust propagation for belief statements. 
We then give a wider characterization for the family of possible functions for proximity 
computation, to which these candidates belong. 

Candidate l(f mu i)- Experiments on trust propagation in the Epinions network (for 
computing a final belief in a statement) [T7] or in P2P networks show that (i) multiplying 
the weights on a given path between u and v , and (ii) choosing the maximum value over 
all the possible paths, gives the best results (measured in terms of precision and recall) 
for predicting beliefs. We can integrate this into our scenario, by assuming that belief 
refers to tagging with a tag t. We thus aggregate the weights on a path p = (ui, . . . ,Ui) 
(with a slight abuse of notation) as 

°" + (p) = IJo-(wi,Wi+i). 

i 

For seeker Alice in our running example, we gave in the previous section (Example [2]) 
the proximity values and the ordering of the network under this candidate for a + . 

Candidate 2(f m i n ). A possible drawback of Candidate 1 for proximity aggregation is 
that values may decrease quite rapidly. A <j + function that avoids this could be obtained 
by replacing multiplication over a path with minimal, as follows: 

a + (p) = mm{a(ui,u i+ i)}. 

i 

Under this a + candidate, the values with respect to seeker Alice would be the following: 
{Bob : 0.9, Danny : 0.9, Charlie : 0.6, Frank : 0.6, Eve : 0.5, George : 0.5, Harry : 
0.5, Ida : 0.25, Jim : 0.25}. 

Candidate 3(f pow ). Another possible definition for <j + we consider relies on an aggre- 
gation that penalizes long paths, i.e., distant users, in a controllable way, as follows: 

a + (p) = A °-K>«i+i) . 

where A > 1 can be seen as a "drop parameter"; the greater its value the more rapid 
the decrease of proximity values. Under this candidate for a + , for A = 2, the rounded 
values w.r.t seeker Alice would be {Bob : 0.46, Charlie : 0.31, Danny : 0.21, Eve : 
0.077, Frank : 0.0525, George : 0.013, Ida : 0.003, Harry : 0.003, Jim : 0.0007}. 

The key common feature of the candidate functions previously discussed is that they 
are monotonically decreasing over any path they are applied to, when a draws values 
from the interval [0, 1]. More formally, they verify the following property: 

Property 1. Given a social network G and a path p = {ux, . . . , u\) in G, we have 
0- + (u!, ...,Ui)> o- + (nx, . . . ,ttf_i). 
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We then define a + for any pair of user (s, u) who are connected in the network by 
taking the maximal weight over all their connecting paths. More formally, we define 
cr + (s, u) as 

o + (s,u) = max p {a + (p) \ s u}. (2.4) 

Note that when the first candidate (multiplication) is used, we obtain the same aggre- 
gation scheme as in [T7], which is also employed in |18| in the context of top- A; network 
aware search. 

Example 3. In our running example, if we use multiplication in Eq. \2.J$ , for the seeker 
Alice, for a = (hence exclusively social relevance), by Eq. we obtain the following 
values for social frequency: SFAu ce (news) = {D4 : 2.6, D2 : 1.01, Dl : 0.7, D6 : 0.6, D3 : 
0.1, D5 : 0.05} and S F Alice (site) = {DA : 1.11, D2 : 1.1, D3 : 0.9, DQ : 0.6, Dl : 0.05, D5 : 
0.05}. 

We argue next that to all aggregation definitions that satisfy Property [1] and apply 
Eq. (12.41) a greedy approach is applicable. This will allow us to browse the network of 
users on the fly, at query time, visiting them in the order of their proximity with respect 
to the seeker. 

More precisely, by generalizing Dijkstra's algorithm [10J, we will maintain a max- 
priority queue, denoted H, whose top element top(H) will be at any moment the most 
relevant unvisited nser0. A user is visited when her tagged items are taken into account 
for the top-k result, as described in the following sections (this can occur at most once). 
At each step advancing in the network, the top of the queue is extracted (visited) and 
its unvisited neighbours (adjacent nodes) are added to the queue (if not already present) 
and are relaxed . Let ® denote the aggregation function over a path (one that satisfies 
Property [1]). Relaxation updates the best proximity score of these nodes, as described 
in Algorithm [TJ 



Algorithm 1: Relaxation 

if (j + (s, u) ® cr(w, v) > cr + (s, v) then 

a + (s, v) = a + (s, u) ® a{u, v) 
end if 



It can be shown by straightforward induction that this greedy approach allows us to 
visit the nodes of the network in decreasing order of their proximity with respect to the 
seeker, under any function for proximity aggregation that satisfies Property [TJ 

We describe in the following section and in Section H] how this greedy procedure for 
iterating over the network is used in our top-fc social retrieval algorithm. Without loss 
of generality, in the rest of the paper, consistent with social theories and with previous 
work on social top-k search, proximity will be based on Candidate 1 (multiplication). 



4 Dijkstra's classic algorithm [TU] computes single-source shortest paths in a weighted graph without 
negative edges. 
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3. Top-k Algorithm for ol = 



As the main focus of this paper is on the social aspects of search in tagging systems, we 
detail first our top-fc algorithm, TOPKS, for the special case when the parameter a is 
0. In this case, fr(i \ s,t) is simplified as 

fr(i | s,t) = sf(i | s,t). 

For each user u and tag t, we assume a precomputed projection over the Tagged relation 
for them, giving the items tagged by u with t; we call these the user lists. No particular 
order is assumed for the items appearing in a user list. 

We keep a list D of top-k candidate items, sorted in descending order by their minimal 
possible scores (to be defined shortly). An item becomes candidate when it is met for 
the first time in a Tagged triple. 

As usual, we assume that, for each tag t, we have an inverted list IL(t) giving the 
items i tagged by it, along with their term frequencies tf(t, i)@ in descending order of 
these frequencies. Starting from the topmost item, these lists will be consumed one item 
at a time, whenever the current item becomes candidate for the top-k result. By CIL(t) 
we denote the items already consumed (as known candidates), by top_item(t) we denote 
the item present at the current (unconsumed) position of IL(t), and we use top_tf(t) 
as short notation for the term frequency associated with this item. 

We detail mostly the computation of social frequency, sf(i \ u,t), as it is the key 
parameter in the scoring function of items. Since when a = we do not use metrics that 
are tag-only dependent, it is not necessary to treat each tag of the query as a distinct 
dimension and to visit each in round-robin style (as done in the threshold algorithm or 
in CONTEXTMERGE). It suffices for our purposes to get at each step, for the currently 
visited user, all the items that were tagged by her with query terms (one user list for 
each term). 

For each tag tj G Q, by unseen _users(i,tj) we denote the maximal number of yet 
unvisited users who may have tagged item i with tj. This is initially set to the maximal 
possible term frequency of tj over all items (value that is available at the current position 
of the inverted list of IL(tj), as top_tf(t)). 

Each time we visit a user u who tagged item i with tj we can (a) update sf(i \ s,tj) 
(initially set to 0) by adding a + (s,u) to it, and (b) decrement unseen _users(i,tj). 

When unseen_users(i,ti) reaches 0, the social frequency value sf(i \ s,tj) is final. 
This also gives us a possible termination condition, as discussed in the following. 

At any moment in the run of the algorithm, the optimistic score MAxSC0RE(i | s, Q) 
of an item i that has already been seen in some user list will be estimated using as social 
frequency for each tag tj of the query the following value: 

top(H) x unseen_users(i,tj) + sf(i \ s,tj). 



5 In TOPKS, even though the social frequency does not depend on tf scores, we will exploit the 
inverted lists and the tf scores by which they are ordered, to better estimate score bounds. In 
particular, as detailed later, this allows us to achieve instance optimality. 
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Symmetrically, the pessimistic overall score, MinScore(z | s,Q), is estimated by the 
assumption that, for each tag tj, the current social frequency sf(i \ s,tj) will be the 
final one. The list of candidates D is sorted in descending order by this lowest possible 
score. 

An upper-bound score on the yet unseen items, MAXSCOREUNSEEN is estimated 
using as social frequency for each tag tj the value top(H) x top_tf(t)). 

When the maximal optimistic score of items that are already in D but not in its top-A; 
is less than the pessimistic score of the last element in the current top-A; of D (i.e., D[k]), 
the run of the algorithm can terminate, as we are guaranteed that the top-A; can no 
longer change. (Note however that at this point the top-A; items may have only partial 
scores and, if a ranked answer is needed, the process of visiting users should continue.) 

We present the flow of TOPKS in Algorithm [2j Key differences with respect to 
CONTEXTMERGE's social branch are (i) the on-the-fly computation of proximity values, 
in lines 1-7 and 29-31 of the algorithm, and (ii) the consuming of inverted list positions, 
when they become candidates, in lines 20-28. For clarity, we first exemplify a TOPKS 
run without the latter aspect (this would correspond to a CONTEXTMERGE run). 

Example 4. Revisiting Example [TJ, recall that we want to compute the top-2 items for 
the query Q = {news, site} from Alice 's point of view. To simplify, let us assume that 
scoreii \ u,t) = sf(i \ u,t) and g is addition. We consider next how the algorithm 
described above runs. 

At the first iteration of the line 8 loop in the algorithm, we visit Bob's user lists, adding 
D3 to the candidate buffer. At the second iteration, we visit Danny's user lists, adding 
D2 and D4 to the candidate buffer. At the third iteration (Charlie's user list) we add 
D6 to the candidate list. Dl is added to the candidate list when the algorithm visits 
Frank's user lists, at iteration 4- Recall that top_tf(news) = 7 and top_tf(site) = 5. 

The 6th iteration of the algorithm is the final one, visiting George 's user lists, finding 
D2 tagged with news, site and DA tagged with site. DA and D2 are the top-2 candidates, 
with MinSC0RE(D4, Q) = 2.61 and MinScore(D2, Q) = 2.21. The closest candidate is 
DQ, with MinScore(£>6,Q) = 1.2 and MaxScore(D6, Q) = 1.2+6x0.1+4x0.1 = 2.2. 
Also, MaxScoreUnseen(Q) = 7 x 0.1 + 5 x 0.1 = 1.2. Finally, MaxScore(£>6, Q) < 
MinScore(D2,(5) and since we have MaxScoreUnseen(Q) < MinScore(D2, Q), 
the algorithm stops returning DA and D2 as the top-2 items. 

We discuss next the interest of consuming of inverted list positions, when these become 
candidates (illustrated in Example [3]). In lines 20-28, we aim at keeping to a minimum 
the worst-case estimation of the number of unseen taggers. More precisely, we test 
whether there are top-A; candidates i (i.e., items already seen in user lists) for which the 
term frequency for some tag tj of Q, tf(tj, i), is "within reach" as the one currently used 
(from IL(tj)) as the basis for the optimistic (maximal) estimate of the number of yet 
unseen users who tagged candidate items with tj. When such a pair (i,tj) is found, we 
can do the following adjustments: 

1. refine the number of unseen users who tagged i with tj from a (possibly loose) 
estimate to its exact value; this is marked when i is added to the CIL list of tj 
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Algorithm 2: TOPKS a= o: top- A; algorithm for a = 



Require: seeker s, query Q = (t±, . . . , t r ) 
for all users u, tags tj G Q, items i do 
cr + (s, u) = — oo 
I s,tj) = 

set IL(tj) position on first entry; CIL{tj) = 
end for 

er + (s, s) = 0; D = (candidate items) 

«— max-priority queue of nodes u (sorted by a + (s,u)), initialized with {s} 
while H / do 

U=EXTRACT_MAX(H); 

for all tags tj G Q ; triples Tagged(u,i,tj) do 
| s, ^) <- s/(i | s, te) + <t+(s, u) 
if i g" Z) then 
add i to D 

for all tags i( G Q do 

unseen _users(i,ti) ■(— top_tf(ti) (initialization) 
end for 
end if 

unseen_users(i,tj) <— unseen_users(i,tj) — 1 
end for 

while 3tj G Q s.t. i = top_item(tj) G D do 

tf(tj,i) «— top_tf(tj) (te's frequency in i is now known) 
advance IL(tj) one position 
A <- t/(tj,i) - top_tf(tj) (the top_tf drop) 
for all items i' £ D\ CIL(tj) do 

unseen _users(i' \tj) <(— unseen _users{i' ,tj) — A 
end for 

odd z to CIL(tj) 
end while 

for all users v s.t. cr(u, v) G E do 

RELAX(u,v) 
end for 

if MinScore(£)[A;],Q) > maxi >k (MAxScORE(D[l], Q)) AND 
MiNScoRE(£)[fc],Q) > MaxScoreUnseen then 
33: break 
34: end if 
35: end while 
36: return D[l], . . . , D[k] 
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(line 27), and from this point on the number of unseen users will only change when 
new users who tagged i with tj are found (line 18). 

2. advance (at the cost of a sequential access) beyond i in the inverted list of tj, to the 
next best item; this allows us to refine (at line 25) the estimates unseen _user s(i' ', tj) 
for all candidates %' for which the exact number of users who tagged with tj is yet 
unknown. 

(We found in the experimental evaluation (Section [7]) that this aspect has the potential 
to drastically improve the cost of the search. Since tf-values in inverted lists fall quite 
rapidly in most practical settings, we witnessed significant cost savings, while using 
relatively few such list position increments.) 

Example 5. Let us now consider how the choice of advancing in the inverted lists 
when possible influences the number of needed iterations. At first, top_tf(news) = 7, 
top_item(news) = DA, and top _tf '(site) = 5, top _item(site) = D2. 

The first iteration only introduces D3 and thus we cannot advance in any of the two 
inverted lists. However, the discovery of D2 and DA in step 2 allows us to fix their 
exact tf values and advance the inverted lists. The new positions are: top_tf(news) = 
2, top _item(news) = Dl, and top_tf(site) = 1, top _item(site) = D6. D6's dis- 
covery in iteration 3 allows us to advance further in the inverted lists. Finally, in 
step A, the discovery of Dl allows the algorithm to advance in the inverted lists to 
top _t f {new s) = 1, top _item(news) = D5, and top _tf (site) = 1, top _item(site) = D5 
(the only undiscovered item). This allows for some drastic score estimation refinements. 
We have the same top-2 candidates, DA and D2 having MinScore(D4, Q) = 1.81 and 
MinSC0RE(D2, Q) = 1.21. The closest item is again D6 having MinSC0RE(D6, Q) = 
MaxScore(/}6, Q) = 1.2, since we know that we have visited all users who tagged D6. 
MaxScoreUnseen(<5) = 1 x 0.3 + 1 x 0.3 = 0.6, since the maximal unseen docu- 
ment, D6 is tagged only once with each tag. MaxScore(D6, Q) < MinScore(D2, Q) 
and MaxScoreUnseen(Q) < MinScore(D2, Q) allows us to exit the loop, two steps 
before the unrefined algorithm, returning the exact top-2: DA and D2. 

We can prove the following property of our algorithm: 

Property 2. For a given seeker s, TOPKS a= o visits the network in decreasing order 
of the o~ + values with respect to s. 

As a corollary of Property^ we have that TOPKS a= o visit users who may be relevant 
for the query in the same order as CONTEXTMERGE [18| . More importantly, we prove 
in Section 13.11 that our algorithm visits as few users as possible, i.e., it is instance 
optimal with respect to this aspect. Moreover, the experiments show that TOPKS can 
drastically reduce the number of visited user lists in practice (see Section [7]). 

3.1. Instance Optimality of TOPKSq,=o 

We will use the same definition of instance optimality as in [12] . For a class of algorithms 
A, a class of legal inputs (instances) D, cost(A, T>) denotes the cost of running algorithm 
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A G A on input T> G D. An algorithm A is said to be instance optimal for its class A 
over inputs D if for every B G A and every D G D we have cost(A, T>)=0(cost(B, T>)). 

Let cjjl be the abstract cost of accessing the user list - a process which involves the 
relatively costly operations of finding the proximity value of the user and retrieving the 
items tagged by the user with query terms - and let users(A, V) be the number of 
total user lists needed for establishing the top-k for algorithm A on input D. Let cs be 
the abstract cost of sequentially accessing the data in IL t , and let seqitems(A,V) be 
the total number of sequential accesses to IL for algorithm A on input D. In practice, 
Cm ^> Cs is a reasonable assumption, hence, for two algorithms A and B, we have 

users(A, V) x cul + seqitems(A,T>) x cs users(A,V) 
users(B,V) x cul + seqitems(B,V) x c,s users(B,V) 

Therefore, for a fair cost estimate in practical social search settings, a reasonable 
assumption is to consider 

cost(A,T>) = users(A,V). 

Let us now define the class of "social" algorithms S to which both TOPKS Q= o and 
CONTEXTMERGE (when a = 0) belong. These algorithms correctly return the top-k 
items for a given query Q and seeker s, they do not use random accesses to IL(t) indexes 
in order to fetch a certain tf value, and they do not include in their working buffers 
(e.g., candidate buffer D) items that were not yet encountered in the user lists. The last 
assumption could be seen as a "no wild guess" policy, by which the algorithm cannot 
guess that an item might be encountered in some later stages. This is a reasonable 
assumption in practice, as the number of items needed for computing a top- A; result for 
a given seeker should in general be much smaller than the total number of items tagged 
by query terms. 

The class D of accepted inputs consists of the inputs that respect the setting described 
in Section [2j 

Theorem 1. TOPKS Q=0 is instance optimal over S and D, when the cost is defined as 
cost(A,T>) = users(A,T>) . 

The optimality proof is given in Appendix [A] 

4. Algorithm for The General Case 

For the general case, in which a G [0, 1], we adapt the CONTEXTMERGE [18] algorithm 
to include the on-the-fly processing of user proximities. 

At each iteration, the algorithm can alternate, by calling ChooseBranchQ, between 
two possible execution branches: the social branch (lines 8-31 of Algorithm |2J) and the 
textual branch, which is a direct adaptation of NRA. 

As in the exclusively social setting of the previous section, we will read term fre- 
quency scores tf(tj, i) from the inverted lists, on a per-need basis, either as in line 21 of 
TOPKSq,=o, or when advancing on the textual branch. Initially, all unknown tf-scores 
are assumed to be set to 0. 



14 



The optimistic overall score MaxScore(«, Q) of an item i that is already in the 
candidate list D will now be computed by setting fr(i \ s,t), defined in Eq. f)2.2p . to 

fr(i | s,t) = (1 — a) x top(H) x unseen _users(i,t) + (1 — a) x sf(i \ s, t) + 
a x max( tf(t, i), top_tf(t) ). 

The last term accounts for the textual weight of the score, and uses either the exact 
term frequency (if known), or an upper-bound for it (the score in the current position 
of IL(t)). 

Symmetrically, for the pessimistic overall score MinSC0RE(z, Q), the frequency fr(i | u, t) 
will be computed as 

fr(i | s,t) = (1 — a) x sf(i \ s,t) + a x max( tf(t,i) , partial _t f (t) ), 

where partial _tf represents the count of visited users who tagged i with tj, which is 
used as lower-bound for tf(tj,i) when this is not yet known. 

The upper-bound for the score on the yet unseen items, MAXSCOREUNSEEN, is esti- 
mated using as overall frequency for each tag tj the following value: 

fr(i | s,t) — a x top_tf(t) + (1 — a) x top(H) x top_tf(t)). 

We present the flow of the general case algorithm in Algorithm[3j Method InitializeQ 
amounts to lines 1-6 of TOPKS a= o, and method ProcessSocialQ amounts to lines 
8-31 of TOPKS a= o (modulo the straightforward adjustment for the count partial _tf). 

The difference between the a = case and the general case is the processing of the 
inverted lists (textual branch), which is done as in the NRA algorithm (see lines 7-13 of 
Algorithm [3]) . We discuss how the choice of the branch to be followed is done, by the 
ChooseBranchQ subroutine, in Section H~T1 

4.1. Choosing between the social and textual branches 

The TOPKS a= o algorithm, in which only the social branch matters, is instance optimal 
(see Theorem [T]), with the cost being estimated as Msers(TOPKS Q= o, T>). As the NRA 
algorithm [12] . when only the textual branch matters, TOPKSq, = i is instance optimal, 
with the cost being estimated as seqitems(TOPKS a= o, T>). 

When a is not one of the extreme values, under a cost function as a combination of 
the two above, of the form 

user s(TOPKS a= o, T>) x cjjl + seqitems (TOPKS a= i, V) x cs, 

a key role for efficiency is played by CHOOSEBRANCHQ. 

In [IS], the choice between the textual branch or the social one was done by estimating 
the maximum potential score of each, in round-robin manner over the query dimensions. 
For a query tag tj, the maximal contribution of the social branch would be estimated as 
MAxSociAL(tj) = (1 — a) x max _tf(tj) x top(H), where max_tf(tj) is the maximum 
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Algorithm 3: TOPKS: top-A; algorithm for the general case 



Require: seeker s, query Q = (ti, . . . , t r ) 
1: InitializeQ 
2: while H / do 
3: ChooseBranchQ 
4: if social branch then 
5: ProcessSocial() 
6: else 

7: for all tags tj € Q, item i = top_item(tj) do 

8: if i D then 

9: add i to D and CIL(tj) 

10: end if 

11: tf(tj,%)4^top_tf{tj) 
12: advance IL(tj) one position 

13: end for 
14: end if 

15: if MinScore(D[A:], Q) > max t> k (MAxS CORE (d[l],Q) and 

MiNScoRE(L)[fc], Q) > MaxScoreUnseen then 
16: break 
17: end if 
18: end while 
19: return D[l], . . . , D[k] 



tf for tj (i.e., the number of taggers for the item that has been tagged the most with tj). 
For the textual part, the maximal potential contribution would be estimated by setting 
MaxTextual(^) = a x top_tf(tj). Then, if MaxSocial(1,-) > MaxTextual(£j) 
the social branch was chosen, otherwise the textual branch is chosen. 

We use a different heuristics for the branch choice. At any point in the run of TOPKS, 
unless termination is reached, we have at least one item r with MAxScORE(r, Q) ^ 
MinScore(D[A;],(5). We consider the item r = D[argmaxi >k (MAxSC0RE(D[l},Q)}, 
which has the highest potential score, and we choose the branch that is the most likely 
to refine r's score (put otherwise, the branch that counts the most in the MAXSCORE 
estimation for r). The intuition behind this branch choice mechanism is that it is more 
likely to advance the run of the algorithm closer to termination. 

For each tag tj G Q, we set MaxTextual(^) to a x top t f(tj) if the term frequency 
tf(tj,r) is not yet known, or to otherwise. For the social part of the score, we set 

MaxSocial(^) = (1 - a) x unseen _users(tj,r) x top(H). 

Then, we follow the social branch if, for at least one of the tags, MAXSOCIAL is greater 
than MaxTextual. 

Note that we deal with the tags of the query "in bulk", and advance simultaneously 
on their inverted lists when the textual branch is followed. 

Remark. We have adopted so far a "disjunctive" interpretation for queries, in which 
items can score on each tag-dimension individually. However, our approach can be 
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Figure 2: Examples on the evolution of proximity values. 



adapted in straightforward manner to a "conjunctive" interpretation: the pessimistic 
score should be maintained at until the item's scores - at least partial ones - are 
known for all tags. 



The algorithm described in the previous section is sound and complete, and requires no 
prior (aggregated) knowledge on the proximity values with respect to a certain seeker 
(e.g., statistics); this was also the assumption in [T5]'s CONTEXTMERGE algorithm. 
Moreover, it is instance optimal in the exclusively social setting (our main focus in this 
paper) with respect to the number of visited users. While we improve the running time 
in both this setting and the general one (more on experimental results in Section [7]), 
in practice, however, the search may still visit a significant part of the user network 
and their item lists before being able to conclude that the top-k answer can no longer 
change. 

But if some statistics about proximity are known at query time (i.e., on how the values 
in a proximity vector variate from the most relevant user to the least relevant one) , this 
may enable us to use more refined termination conditions, and thus to minimize the gap 
between the step at which the final top-k has been established and the actual termination 
of the algorithm. Indeed, the experiments we performed on Del.icio.us data showed that, 
in average, the last top-k change occurs much sooner, hence there is a clear opportunity 
to stop the browsing of the network earlier. 

We take a first step in this direction, discussing two possible approaches for using score 
estimations based on proximity statistics, which trade accuracy for efficiency (in terms of 
visited users). More specifically, in Algorithm [31 the MaxScore, MaxScoreUnseen 
and MinScore bounds have all used the safest possible values for the proximities of yet 
unseen users: either the top (maximum) value of the max-priority queue (top(H)) for 
the first two bounds, or its minimal possible value (zero) for the third one. In practice, 
however, any of these extreme configurations is rarely met. For illustration, we give in 
Figure [2] the proximity vectors for some randomly sampled users. Observe that these fall 
rapidly, and this may be the case in many real-world similarity or proximity networks. 

Hence one possible direction for reducing the number of visited users is to pre-compute 
and materialize a high-level description (more or less complex, more or less accurate) 



5. Efficiency by Approximation 
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of users' proximity vectors (of their distribution of values) . This would allow us to use 
a tighter estimation for the remaining (unseen) users, instead of uniformly associating 
them the extreme score (top(H) or 0). In doing so, we may obviously introduce approxi- 
mations in the final result, and our approximate techniques provide a trade-off between 
accuracy drop on one hand and negligible memory consumption and reduced running 
time on the other hand. 

5.1. Estimating bounds using mean and variance 

We first consider as a proximity vector description one that is very concise yet generally- 
applicable and effective, keeping for a given seeker two parameters: the mean value 
of the proximities in the vector and the variance of these values. We adopt here the 
simplifying assumption that the values in the seeker's vector are independent, essentially 
interpreting the proximity vector as a random one. 

At any step in the run of the algorithm, using the mean and variance, for the remaining 
(yet unvisited) unseen_users(i,t) for a given item i and tag t G Q, we can derive (a) 
lower bounds for the average of their proximity values, for MlNScORE estimations, or (b) 
upper bounds for the average of their proximity values, for MAXSCORE estimations. The 
guarantees of these bounds can be controlled (in a probabilistic sense) via a precision 
parameter 5 G (0, 1], by which lower values lead to higher precision and 1 leads to a 
setting with no guarantees. 

More precisely, let p be the current position in the proximity vector and let (Jp.( s ) be 
the vector containing the remaining (unseen) values of cr + (s). Knowing the overall mean 
and variance of the entire proximity vector <J + (s), and having the proximity values seen 
so far (denoted o"^ (s)), we can easily compute the average and variance of the remaining 
proximity values (those in cr+(s)). 

Then, the mean and variance of the average of unseen _users(i, t) randomly chosen 
proximity values from the remaining ones can be obtained as followsjj 

Exp[(jp., unseen _users(i, t)] 

Var [(Tp., unseen _users(i, t)] 

When the input query contains more than one tag, its size \Q\ needs to be taken into ac- 
count in the estimations. In order to avoid computational overhead, we uniformly chose 
a non-optimal per-tag probabilistic parameter 5' that ignores per-tag score distributions, 
as follows: 

5' = i_ (1-5)1/101. (5.1) 

EstMaxQo, 5) represents, for each query tag, the upper bound of the expected value of 
the average of unseen_users(i, t) values drawn from er+(s), which holds with probability 
at least 1 — 5'. Similarly, EstMinQo, 5) represents the lower bound of the expected 

6 This is possible under independence assumptions that may not entirely hold, but turn out to be 
reasonable in practice (see Section [7]) . 



unseen _users(i, t) 
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Table 1: Optimistic and pessimistic estimations of fr(i | t, u) at step p (general case). 



score 


i e ciL(t) 






estimation 


MlNScORE(i,t) 


yes 
no 


a X tf(i, t) - 


a 


Q ) x 1 + EstMin(p, (5) X unseen users(i,t)) 
X partial tf(t, i) + (1 — Of) X s/(i s, t) 


MAxScoRE(i,t) 


yes 
no 


a x t/(i,t)4 
a X top tf(t) 


"(1- 
+ (1 


a) X (s/(i | s, t) + EstMax(p, S) x unseen users(i,t)) 
- a) x (s/(j | s, t) + EstMax(p, 5) x unseen users(i, t)) 


MAxScOREUNSEEN(t) 




a 


x top 


_tf(t) + (1 - a) x EstMax(p, <5) x top_tf(t) 



value of the average of unseen_users(i,t) values drawn from cx+(s), which holds with 
probability at least 1 — 5'. For estimating MinScore when % ^ CIL(t), the fact that 
we have no information about the difference between tf(i,t) and partial _tf(t,i) (the 
users who tagged item i with t so far) means that we cannot assume that other users 
may have tagged i, so we keep this estimation as in the initial (exact) algorithm. 
By using Chebyshev's inequality, these bounds can be computed as follows: 



EstMax(p,(5) = E[a+{s)} 
EstMin( p ,5) = E[<r%(s)\ 



Var 




unseen _users(i,t) x 5' 


Var 


His)} 



unseen _users(i,t) x 5' 



We give the score estimations, changed by generalizing the proximity estimations, in 
Table [TJ We present in the experimental results the effect of this approximate approach 
on running time, showing significant overall improvement. In our experiments, even for 
5 = 0.9, the returned top-A; answers had reasonable precision levels (around 90%). 

We discuss in the next section another approach for tighter score estimates, using more 
detailed descriptions of proximity vectors. We conclude this section with a discussion 
on how these concise descriptions of proximity vectors could be maintained up-to-date 
in dynamic environments, in Section 15.31 



5.2. Estimating bounds using histograms 

The advantage of the approach described the previous section is twofold: low mem- 
ory requirements and estimation bounds that are applicable for any value distribution. 
However, it may offer estimation bounds that are too loose in practice, and hence not 
reach the full potential for efficiency of approximate score bounds. To address this is- 
sue, we can imagine - as a compromise between keeping only these two statistics and 
keeping the entire pre-computed proximity vector - an approach in which we describe 
the distribution at a finer granularity, based on histograms. 

More precisely, for a seeker s, we denote this histogram as h(cr + (s)). It consists of b 
buckets, each bucket bi, for % 6 {1, . . . , b}, containing rij items in the interval (loWi, highi] 
(the values are assigned to bucket b). Then, the probability that there exists a proximity 
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value x greater than lowi, knowing the histogram h(a + (s)), is 

i 

Pr[x > lowi | h(a + (s))] = rij/n. 

j'=i 

At any step p in the run of the algorithm, we maintain a partial histogram denoted 
as h(<Tp.(s)), obtained by removing from h(a + (s)) the p already encountered proximity 
values. 

Similar to the previous approach, we can drill down the overall 5 parameter to a 5' 
one for each query tag. Then, EstMaxQo, 5) can be given by the minimal value in the 
partial histogram, such that the resulting estimation of MAxSC0RE(i, t) holds with at 
least probability 1 — 5'. Conversely, EstMinQo, 5) is given by the maximal value in the 
partial histogram, such that the resulting estimation of MlNSC0RE(i, t) holds with at 
least probability 1—5'. 

In manner similar to Eq. fl5.ip . we need to take into account the fact that a number 
of unseen _users(i,t) such estimated values lead to an overall approximate estimation, 
for both EstMin and EstMax. Therefore, each of these values is uniformly estimated 
using a stronger probabilistic parameter 5"(i,t), depending on unseen _users(i,t), as 
follows: 

S >r (i t) 1 (1 ^"^l/unseen_users(i,t) 

Formally, having h(a+(s)) and 5"(i,t), we estimate EstMax(j», 5) and EstMin(j», 5) 
as follows: 

ESTMAxfjo, 5) = min{loWi \ Pr[x > lowi \ h(a£.(s))} < 5"(i,t)}, 

EstMin(p, 5) = max{loWi | Pr[x > louii | h(a^.(s))] > 1 - 5"(i,t)}. 

The space needed for keeping such histograms is linear in the number of users and 
buckets. For instance, by setting the latter using the square-root choice, the memory 
needed is 0(n^). Also, as a consequence of the on-the-fly computation of proximity 
values, we can easily update the histogram of the seeker by merging the partial, "fresh" 
histogram obtained in the current run (until termination) with the remaining values 
from the existing (pre-computed) histogram. 

5.3. Maintaining the description of the proximity vector 

Since social tagging applications are highly dynamic in nature, we need to take into 
account the fact that the statistics we keep are likely to change quite often. While we 
can hope that mean, variance and even histogram descriptions are less subject to change 
than individual proximity values, we should still strive to maintain these statistics as 
fresh as possible. Recomputing them from scratch, at certain intervals, is an obvious 
option to consider, though one that may still be too expensive, knowing that we want to 
avoid keeping the n x n materialized proximity matrix, as well as naive re- computation 
of mean and variance pairs. 
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A more suitable alternative would be to rely on approximate techniques for maintain- 
ing a fully dynamic all-pairs shortest path information (APSP) in the network. Since 
our proximity metric relies on path multiplication, we can reformulate the computation 
of proximity values into a problem of computing shortest paths in a network with (a) 
the same set of vertices and edges, and (b) edge weights valued w(u,v) = — log a(u,v), 
where a(u,v) is the user proximity from the original network. 

A (2 + e)-approximate algorithm was given in [5j, which handles fully-dynamic updates 
in a graph in 0(e) (almost linear) time. It exhibits a query time of O (log log log n) (the 
query returns an estimation of the shortest distance between two nodes), without the 
need of keeping a distance matrix. We could directly rely on this algorithm in the 
transformed — log a(u, v) graph. Mapping back the distances thus queried to our setting 
would give us an estimation a^ st (u,v) that verifies the inequality: 

a+(u,v) > <r+t(u,v) > a+(u,v) 2+t . 

For a given seeker s, we could thus compute an approximation of its proximity vector 
in O (n log log log n) time, and then compute the approximate statistics efficiently 

6. Scaling and performance 

We argue in this section that, in a real-world setting, our algorithm TOPKS outperforms 
the one from existing literature both in terms of memory requirements and execution 
time. We discuss its practical impact in experiments in Section [7J 

Let us consider, as an illustrating example, one of the most popular bookmarking 
applications, Del.icio.us, which currently has probably around 10 7 users. Unsurprisingly, 
this social network is quite sparse, with an average degree of about 100. If a similar 
graph configuration would be maintained when weights (the a function) are associated 
to the edges of the network (e.g., based on tagging proximity or some other measure) the 
size of an index that would precompute the extended proximity value for each pair of 
connected users in the network (the a + function) would be roughly of 700 terabytes (i.e., 
(10 7 ) 2 x 7 bytes, considering that 3 bytes are necessary for an user Id and 4 bytes are 
necessary for the float value of proximity). On the other hand, the weighted graph would 
require memory space of roughly 7 gigabytes (as 10 7 x 100 x 7 bytes), and could easily 
fit in the RAM space of an average commodity workstation^] More, existing techniques 
for network compression [9] might allow us to reduce the space required to store the 
network by a factor of 10 — 15 while still supporting efficient updates and random access 
on compressed data. 

The difference in memory requirements for the two alternatives becomes much more 
drastic when assuming a user base of the order of Facebook's social network, which cur- 
rently consists of roughly 7 x 10 8 users (and is still growing at a fast pace). Precomputed 
lists for extended proximity go up to about 400 petabytes of memory space, while the 

7 We stress that, for the sake of generality, this is not assumed nor exploited in our algorithms, and is 
not accounted for in the experimental results for TOPKS (in both abstract cost and running time). 
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Table 2: Computational costs for processing a query Q, when a = 0. 



Algorithm 


Disk access 


RAM access 




RA SA 




ContextMerge 


1 n 


(101-1) xn 


TOPKS a=0 





0(n\gn + e) + ( \Q\ -1) xn + n + e 



network itself requires only about 400 gigabytes. The space needed to store the network 
can further decrease to fit RAM capacity that moderate commodity servers can provide 
today, if considering the compression techniques mentioned previously. 

We next discuss general performance aspects, which in practice may be as impacting 
as the memory and updatability advantages that our algorithm presents. 

Let n denote the number of users and let e denote the number of edges in the network. 
We assume without loss of generality that the query consists of a single tag (for multiple- 
tag queries, all dimensions can share the results of a single <j + computation). 

For our algorithm, let us assume that the social network resides in main memory, 
e.g., by means of adjacency lists: for each vertex, we have a list of its neighbors and 
their associated weights (we can safely assume the list comes presorted descending by 
weight). For one top-A; query execution, we will need at most n + e operations to visit 
the entire network (we are guaranteed to take each vertex only once). For the proximity 
computation we can use a Fibonacci-heap based max-priority queue, since our graph is 
likely to be very sparse |16j . Each insertion into the heap takes 0(1) amortized time, 
each extraction takes O(lgn) and each increase of a key (a relaxation step) takes O(lgn), 
for an overall queue complexity of O (n lg n + e) . 

CONTEXTMERGE requires no computations for proximity at query time. However, 
it uses disk accesses to read the precomputed proximity values: one random access to 
locate the seeker's list and n sequential disk accesses to read this list. (It suffices to do 
this just for one query term, and then keep and access a shared copy of this list in main 
memory. ) 

If we value the latency of a memory access as 1 and the one of a sequential disk 
clCCGSS clS t (usually about five orders of magnitude slower than RAM access), with minor 
simplifications, our algorithm has the potential to perform better than CONTEXTMERGE 
when the following holds: t > lgn + -. So the network sparseness should verify the 
following inequality: 

e < n x (t — lgn), 

which is a very plausible assumption in real applications. 

A summary of this comparison on execution time is given in Table [2J Note that in this 
analysis we omitted initialization costs: the overhead necessary for CONTEXTMERGE 
to compute cr + values for all user pairs and the overhead to load in main-memory the 
social network, for our algorithm. 
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7. Experimental Results 



Dataset and testing methodology. We have performed our experiments on a pub- 
licly available Del.icio.us dataset j2U], containing 80000 users tagging 595811 items with 
198080 tags. As this dataset does not give information regarding links between users, 
we have generated three similarity networks: 

• Item similarity network. This network was constructed by computing the Dice 
coefficient of the common items bookmarked by any two users, resulting in a 
network of 49038 users and 3329540 links. 

• Tag similarity network. This network was generated by computing the Dice coef- 
ficient of the common tags used by any two users. Since this computation results 
in a network that is too dense, we have filtered out the users who used less than 
10 distinct tags in their tagging activity. The final networks thus contains 40319 
users and 8335544 links. 

• Item-tag similarity network. This network was constructed by computing the Dice 
coefficient of the common items and tags bookmarked by any two users, resulting 
in a network containing 40353 users and 1849898 links. 

We computed the top-10 and top-20 answers, generating a number of 20 two and 
three-tag semantically coherent queries, from tags that have a medium frequency (i.e., 
between 3000 and 5000 in our dataset). For each similarity network, 10 random users 
were also randomly chosen in the role of the seeker. 

Testing was performed using two ranking functions (the /i-function from our model). 
The first one is the standard tf-idf ranking function: 



and the aggregation function g is summation. 

While these are two of the most commonly used ranking functions in IR literature, they 
have different properties when used in approximate approaches as the ones we describe. 
More precisely, since tf-idf is a linear function, both the maximal and minimal estimates 
over fr scores lead to valid estimates for the overall scores. This is not necessarily the 
case for BM15: since it is a concave function, only the maximal overall score can be 
estimated. This was taken into account in the experiments. 



score{i \ u,t) = fr(z | u, t) x idf(t). 



The second one is the BM15 ranking function used in |18| : 




where inverse frequency idf(t) is defined in standard manner as 



idf{t) = log 



Z\ — \{i | Tagged(v, i,t)}\ + 0.5 
\{i | Tagged(v , i, t)}\ + 0.5 
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We used a Java implementation of our algorithms, on a machine with a 2.8GHz Intel 
Core i7 CPU, 8GB of RAM, running Ubuntu Linux 10.04 and PostgreSQL 9.0. 

As our focus is on optimizing the social branch of the top-k retrieval, we report here 
our results for a G {0, 0.1, 0.2, 0.3}. As [IB], multiplication over the paths was chosen as 
the proximity aggregation function, as the best suited candidate for predicting implicit 
similarities. 

Remark. The relevance of personalized query results is a topic that has been ex- 
tensively treated ( |19[ [Til [T5]). It is not our focus here, and we interpret the relevance 
of results as a consequence of the scoring functions g and h. Moreover, the query it- 
self could be viewed as the result of a transformation using techniques such as query 
expansion. The relevance of social search results was also extensively evaluated in [18], 
over Del.icio.us data, in a setting (including ranking model) similar to ours. We report, 
however, on two ground-truth experiments for evaluating the relevance of top-fc results 
for a = 0, at the end of this section. 

Efficiency results. For the testing environment described previously, we report on 
efficiency for both exact and approximate algorithms, and on precision for the latter. 

For efficiency, we report on two measures: the abstract cost of the algorithms and 
their wall clock running times. Abstract cost, which is the standard measure for early- 
termination algorithms that depend on database accesses, is computed as defined in 
Section 14. 1[ by choosing qxl, the cost of accessing a user lists, as valued 100 (a very 
conservative upper-bound), and Cs, the cost of sequentially accessing an item in an 
inverted lists, valued 1. More formally, 

cost(A, D) = 100 x users(A, D) + seqitems(A, D). 

We ignore differences in favor of TOPKS that are hard to account for, namely we do not 
distinguish between the user accesses by CONTEXTMERGE (which in a real setting would 
be to external memory) and the ones by TOPKS (which would be to main memory). 

Figures [3141 and present the comparison of abstract costs and running times for the 
BM15 and tf-idf ranking functions, for each of the three similarity networks. In each 
subfigure, the first pair of columns gives the abstract cost of [18] 's ContextMerge 
algorithm, the second pair of columns the one of TOPKS, the third pair of columns the 
cost of TOPKS /MVar (approximate approach based on mean and variance of proximi- 
ties, described in Section I5TT1) and the fourth pair of columns the cost of TOPKS / Hist 
(approximate approach based on histograms, described in Section [5.2p . For each algo- 
rithm, the average running times were recorded, and are represented by the black line 
in the plots (one dot indicates the average running time between the top-10 and the 
top-20). One can notice there that abstract cost closely captures the actual performance 
of the algorithms. However, running time optimization was not the focus of the present 
work, and many alternatives remain to be explored in that direction (e.g., tuning the 
database) @ 



Note that we cannot compare with pQ's approach, as it only extends classic top-k retrieval by inter- 
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Table 3: Comparison between CONTEXTMERGE and TOPKS a=0 . 



Network 


CONTEXTMERGE 


TOPKS a=0 


users 


seqitems 


users 


seqitems 


item 


21878 





15588 


65 


item-tag 


13028 





6898 


54 


tag 


18718 





15581 


68 



First, we can see that in general TOPKS drastically improves efficiency when com- 
pared to CONTEXTMERGE, in terms of both running time and abstract cost. For exam- 
ple, in the item-tag similarity network, when a = 0, the running time and abstract cost 
are around 50% of that of ContextMerge. 

Moreover, our approximate approaches lead to further improvements, which support 
the intuition that even limited statistics (such as mean and variance) can render the 
termination conditions more tight. 

The abstract costs of TOPKS/MV" ar and TOPKS /Hist in the figure were obtained 
for the probabilistic threshold 5 = 0.9. Even though this represents a quite weak guar- 
antee, we found that it still yields a good precision/efficiency trade-off. For a better 
understanding of this trade-off, we show in Figure the impact of S on precision. When 
a > 0, visiting the per-term inverted lists in parallel to the proximity vector helps in 
deriving tighter score bounds for unseen items, leading to a faster termination of the ap- 
proximate approaches. These tighter score bounds also help in achieving better precision 
levels when a > 0, as Figure [7] shows. 

Furthermore, our branch choice heuristic in TOPKS (in both the exact and approxi- 
mate variants) brings significant improvements overall (for instance, consider the differ- 
ence between the cost savings for a = and a = 0.1, in the tag similarity network). 
Finding even more effective heuristics for this aspect of the algorithm remains an inter- 
esting direction for future research. 

We discuss next how the instance optimality of TOPKSq,=o reflects in the perfor- 
mance results. Table [3] reports the number of visited users by CONTEXTMERGE and 
TOPKS a= o (columns users), for the three similarity networks. One can see that TOPKSq, = o 
achieves good savings (in terms of visited users), while relying only on very few sequential 
accesses in the inverted lists (column seqitems). 

Finally, we consider the impact of the probabilistic parameter S on precision and 
speedup in the approximate algorithms. We define precision as the ratio between the 
size of the exact result (by TOPKS) and the number of common items returned by the 
respective approximate approach and TOPKS, i.e., 

. . TroPKS/app n Ttopks 
precision = — ; , 

|7topks| 



preting user proximity as a binary function (0-1 proximity), by which only users who are directly 
connected to the seeker can influence the top-k result. 
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where Tropics/app is the set of items returned as top-k by the approximate algorithms 
(either TOPKS /MVar or TOPKS/ Hist), and Ttopks is the set of items returned by 
the exact algorithm. 

The relative speedup is defined as 

cost(TOPKS,D) 
SP P ~ cost(TOPKS/app,D) ~ 

We present in Figure [6] the results for both approximate approaches, TOPKS /MVar 
and TOPKS / Hist. For TOPKS /MVar, one can notice that S has a limited influence on 
precision (with a minimum of 0.997 for 5 = 1), while ensuring reasonable speedup. The 
speedup potential is greater when using TOPKS/ Hist and histograms, while reasonable 
precision levels are obtained (for instance, precision of around 0.805 when 5 = 0.9, for 
a speedup of around 2.5). For values of 5 > 0.9, we notice however a rapid drop in 
precision. The fact that MVar achieves better precision than Hist may seem counter- 
intuitive, since histograms give a more detailed description of proximity vectors. This 
difference in precision is due to looser bounds for MVar, as they directly influence the 
termination condition of the algorithm, result in a longer run and hence to better chances 
of returning a more refined top-k results. 

We also considered the influence of the a parameter on precision, while setting the 
probabilistic parameter to 5 = 0.9 (see Figure [7]). We have measured both precision@10 
(i.e., when requesting the top-10) and precision@20 for both TOPKS /MVar and TOPKS /Hist. 
We observed that the precision levels for TOPKS /MVar are quite stable for all values 
of a. For TOPKS/ Hist, the lowest values of precision are witnessed when a = 0, but 
they stabilize to high values (above 0.97) for a > 0. 

Evaluating relevance We report now on two "ground-truth" experiments we have 
performed to test the bookmark prediction power of the exclusively social queries. 

For the first experiment, we have selected (user, tag) pairs from users that have book- 
marked between 5 and 10 items using tags that were used globally at least 1000 times. 
The objective of this experiment was to estimate the power of personalized results to 
predict items that are tagged using relatively popular tags. 

Then, 1000 pairs were randomly selected. For each pair and for k G {1,2,5,7,10}, 
we computed the following top-A; result, using as query the tags corresponding to the 
distinct user-ids: the network-unaware top-A;, and, setting the user-id as the seeker, the 
personalized top- A; (for a = 0) for each of the following aggregation functions: f mu i, 
fmin, and f P ow with A G {1.1,2}. For each personalized query, the items belonging to 
the seeker were ignored (so as not to influence positively the precision of the results). 

An item was considered as "predicted" if it appeared in the resulting top-A; and was 
also tagged by the seeker userid with the query tags. We traced the proportion of pairs 
for which at least one such item has been predicted. 

The results are presented in Figure [HJ One can note that, for the item and item- 
tag similarity networks, personalization is considerably better at predicting bookmarked 
items than the "global" top-A;, for all functions, except f min and, to a lesser extent, f pow 
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Figure 3: Abstract cost and running time comparison over the tag-similarity network 
and the f mu i proximity function (red: top-10, yellow: top-20). 
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jure 4: Abstract cost and running time comparison over the item-similarity network 
and the f mu i proximity function (red: top-10, yellow: top-20). 
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Figure 5: Abstract cost and running time comparison over the item-tag similarity 
work and the f mu i proximity function (red: top-10, yellow: top-20). 
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Figure 6: Precision rates versus speedup relative to TOPKS, when a = 0. 
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Figure 7: Precision rates vs. a, when 5 = 0.9. 
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Figure 8: Predicting bookmarks tagged with semi-popular tags. 
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Figure 9: Predicting unpopular bookmarks. 



(A = 1.1). Moreover, the tag similarity network seems to not be such a good predictor, 
no matter the personalization function used, as the other two networks. This might 
indicate the fact that, in the case of tag similarity, one needs to go beyond simple set 
similarities and include more complex relationships between tags, like synonymy and 
polysemy. 

For the second experiment, we have selected (user, item, tag) triples resulting from 
items that have been tagged only by few people in the network (between 5 and 10). The 
objective of this experiment was to estimate the power of personalizing results to predict 
items that are unpopular, i.e., the "long tail". The tests and the measures tracked are 
identical to the setup of the first experiment. 

The results are presented in Figure They are similar to a good extent to those of the 
first experiment, with two main differences: (i) personalization fails completely in the 
tag similarity network, (ii) in the item and item-tag similarity networks personalization 
achieves considerably higher prediction performance than in the case of predicting items 
tagged with popular tags. This is because, in the case of long-tail items, the functions 
that are skewed towards the closest users, i.e., f mu i and f pow , A = 2, will rank higher the 
items belonging to the closest users. 
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8. Other Related Work 



The topic of search in a social setting has received increased attention lately. Studies and 
models of personalization of social tagging sites can be found in [T§| [T3"j ITT} |2~T] . Other 
studies have found that including social knowledge in scoring models can improve search 
and recommendation algorithms. In [8], personalization based on a similarity network is 
shown to outperform other personalization approaches and the non-personalized social 
search. A study on a last.fm dataset in [15J has found that incorporating social knowledge 
in a graph model system improves the retrieval recall of music track recommendation 
algorithms. An architecture for social data management is given in [2j |3], along with a 
framework for information discovery and presentation in social content sites. Another 
approach to rank resources in social tagging environments is CubeLSI [6], which uses 
a vector space model and extends Latent Semantic Indexing to include taggers in the 
feature space of resources, in order to better match queries to documents. FolkRank |14| 
proposes a ranking model in social bookmarking sites, for recommendation and search, 
based on an adaptation of PageRank over the tripartite graph of users, tags and resources. 
It follows the intuition that a resource that is tagged with important tags by important 
users becomes important itself and, symmetrically, for tags and users. An alternative 
approach to social-aware search, using personalized PageRank, was presented in [3]. 
There, the same tripartite model of annotators, resources and annotations is used to 
compute measures of similarities between resources and queries, and to capture the 
social popularity of resources. However, none of these approaches incorporate the user- 
to-user relationships in their ranking model. In contrast, the social network is an integral 
part of the scoring model in our setting, if not the decisive one, while this network can 
have various semantics (e.g., tagging similarity, activity similarity or even trust). 

The scoring model used in [18] is revisited in [22]. There, a textual relevance and 
a social influence score are combined in the overall scoring of items, the latter being 
computed as the inverse of the shortest path between the seeker and the document 
publishers. This model is also used in the context of top-k retrieval of spatial web 
objects [7], where a prestige-based relevance score is computed by combining the overall 
relevance of an object with its spatial distance. 

9. Conclusions and Future Work 

We considered in this paper top-k query answering in social bookmarking applications, 
proposing algorithms that have the potential to scale in real applications, in an online 
context where the social network, the tagging data and even the seekers' search ingredi- 
ents can change at any moment. Our solutions address the main drawbacks of previous 
approaches. With respect to applicability and scalability, we avoid expensive and hardly 
updatable pre-computations of proximity values, by an on-the-fly approach. We show 
that it is applicable to a wide family of functions for proximity computation in a social 
network. With respect to efficiency, we show that TOPKS is instance optimal in the 
exclusively social context and, via extensive experiments, that it performs significantly 
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better than the algorithm from previous literature. We also considered widely-applicable 
approximate techniques, showing they have the potential to drastically reduce computa- 
tion costs, while exhibiting high accuracy. 

We see many directions for future work. As mentioned in the previous section, op- 
timizing the branch choice heuristic is a promising direction that we plan to explore 
further. Experimenting with other aggregation functions, probabilistic bounds using 
statistics tailored to certain assumptions (e.g, for power-law distributions) or richer de- 
scriptions for proximity vectors and term-frequencies are other important directions. We 
are also investigating approaches for computing results in a distributed style, when one 
has access to query results pertaining to various seekers, or when the same query is run 
at various points in the network. Finally, we intend to adapt our approach to deal with 
networks containing also negative links (e.g., trust / distrust networks). 
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A. Proof of Theorem 1 

Proof. Since on each access to a user list, all items tagged by the respective user with 
any of the query terms are retrieved, the position in the proximity vector at any step 
in the run of the algorithm is not tag-dependent. So cost(A,T>) is equal to the position 
p in the seeker's proximity vector at the moment of A's termination. Throughout the 
proof, we use the subscript p to denote the value of a given variable at step p in the 
execution of A. We will use a proof argument similar in style to the one for NRA [12j. 

Let us assume that TOPKS a= o does not stop at position p (in the proximity vector) 
and that there exists an algorithm A ^ TOPKS Q= o that does. 

Since TOPKS a= o does not stop at position p, there exists an item r ^ {Z) p [l], . . . , D p [A;]} 
having MAxScORE p (r, Q) > MmScOKE p (D p [k],Q), and MlNScORE p (r, Q) < 
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MiNSC0REp(D p [i],<9),Vi G {l,...,k}. If MlNSC0RE p (r, Q) = MiNSC0RE p (D p [fc], Q) 
then necessarily MAxScORE p (r, Q) ^ MaxScore p (.D p [/c], Q) (ties for pessimistic scores 
are broken by the optimistic ones, then arbitrarily for the optimistic scores). 

In D, we assume that at step p we have with TOPKS Q= o in the current (unconsumed) 
position in each of the \Q\ inverted lists IL(tj) an item Vj, necessarily not yet candidate. 
By definition, for any algorithm A G S, for any tag tj of the input Q, A is at most as 
advanced in the inverted list IL(tj) as TOPKS a= o. Without loss of generality, let us 
assume A is as advanced as TOPKS a= o- 

Towards a contradiction, showing that A is not sound over all possible inputs, we will 
construct an instance T>', which is equal to T> up to position p. We consider the following 
two possible cases: 

Case 1: A outputs r as one of the top-k item, i.e., there do not exist k items having 
a higher score than r. 

In D' will start from what A could have already read and used, including the items 
Vj = top_item p (tj) and the value top p (H) (the proximity value of the p + 1 user). 

V will be such that SC0RE(r, Q) = MlNScORE p (r, Q), and text scS cor e(D p [i], Q) = 
MAxScOKE p (D p [i],Q),Vi G {l,...,k}. Now, for each D p [i], if MAxScORE p (.Dp[z], Q) > 
MinSG0RE p (D p [z], Q), i.e., we do not have Z) p [z]'s final score at step p, we assume the 
following in T>' . For each tj G Q for which tf(D p [i],tj) is unknown, we assume that we 
have D p [i] in IL(tj) after Vj, with tf(D p [i], tj) = top_tf p (tj). Also, for every tj G Q we 
set in the proximity vector, after p+ 1, the next Xij = unseen _user s(D p [i], t j) values to 
topp(H), making also D p [i] present in each of these users' lists for tj. By doing so, the 
exact score of each D p [i], % G {1, . . . , k}, is equal to the maximal possible one at step p; 
after maxij(xij) steps, all these k scores Score(D p [i],Q) would be computed. 

For item r, for each tj G Q for which we do not have tf(r, tj), since r must come later 
in IL(tj) (after Vj), we can assume that tf(r,tj) = partial _tf(r,tj) (this makes 
unseen _users{r,tj) = 0). Also, for every tj G Q for which we do know tf(r,tj), after 
the required maxi t j(xij) proximity values set as described previously, we set the next 
unseen _users(r, tj) in the proximity vector to 0, with each of these users having tagged 
r with tj. All this ensures that MlNScORE p (r, Q) = SC0RE(r, Q). 

We can now contradict the correctness of algorithm A, showing that SC0RE(r, Q) < 
SC0RE(D p [i],Q) for all i. 

We have the following inequalities: 

MlNScORE p (D p [fc],Q) ^ MlNSC0RE p (r, Q) (A.l) 
MlNSCORE p (D p [/c],g) < MlNSCOREp(D p [l],Q),Vi (A.2) 
MinScore p (L> p [2],Q) < MAxScoREp(D p [z],Q),Vi (A.3) 

If MlNSC0RE p (r,g) < MlNSGOREp(D p [ife],Q) then it follows from Eq. CO) . (IA~2D . (TA~3]) 
that 

SC0RE(r, Q) < ScORE(D p [i],Q),Vi. 
If MlNScORE p (r, Q) = MlNScORE p (D p [fc], Q) then, for each i G {1, . . . , k}, if: 
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1. Mw$CORE p (D p [k],Q) = M.wScORE p (Dp[i],Q): we have MAxSC0RE p (r, Q) > 
MmSC0REp(Dp[k}, Q) and MAxSC0RE p (r, Q) < MAxScORE p (.D p [i], Q); it follows 
that ScORE(r, Q) < ScORE(D p [i], Q), 

2. MlNSC0RE p (r, Q) < MinScore p (D p [A;],(5): we have MinScore p (D p [/c], Q) < 
ScORE(D p [i],Q); it follows that ScORE(r, Q) < ScORE(D p [i], Q). 

Hence, in any possible configuration, r is not in the top-/c result over V. But since V 
and V are indistinguishable by algorithm A, which stops at step p outputting r in the 
result, this contradicts A's correctness. 

Case 2: A does not output r as a top-k item, which means that A assumes that the 
final score of r, ScORE(r, Q) is not in the top-A; scores for V. 

V, undistinguishable from V up to position p, will now be such that ScORE(r, Q) = 
MAxSC0RE p (r, Q) and ScORE(D p [i], Q) = MmSC0RE p (D p [i],Q), for each D p [i] G D p 
s.t. D p [i] 7^ r. 

If r's score at step p is not already the final one, i.e., MAxScORE p (r, Q) = MlNScORE p (r, Q), 
we assume the following in V: for each tag tj G Q for which tf(r,tj) is yet unknown, 
we assume that r comes later (after Vj) in IL(tj), having tf(r,tj) = top_tf p (tj). Then, 
for every tj G Q we set in the proximity vector, after the p + 1 position, the next 
Xj = unseen _users(r,tj) values to top p (H), making also r present in each of these 
users' lists for tj. 

By this, the exact score of r is equal to the maximal possible one at step p; after 
maxj(xj) steps, the score ScOREr, Q) would be computed. 

Symmetrically, for each each D p [i] G D p s.t. D p [i] ^ r, and each tj G Q for which 
tf(D p [i], tj) is yet unknown, we assume that D p [i] comes later (after Vj) in IL(tj), having 
tf(D p [i],tj) = partial _tf(D p [i],tj) (hence unseen _users(D p [i],tj) = 0). Then, for 
every tj G Q for which we know tf(D p [i],tj), after the maXj{xj) values set as described 
previously in the seeker's proximity vector, we set the next = unseen _users(D p [i],tj) 
values to 0, making also D p [i] present in each of these users' lists for tj. This construction 
ensures that, the exact score of each D p [i] is equal to the minimal possible one at step 
p; after max^j{yij) steps, all these scores ScORE(D p [i], Q) would be computed. 

Since we have that 

SC0RE(r, Q) = MAxSC0RE p (r, Q) > MmSC0RE p (D p [k], Q) 

and MmSC0RE p (D p [k],Q) = ScORE(D p [k},Q), given that for every item D p [l], I > k 
s.t. D p [l\ ^ r we have SC0RE(D p [l],Q) ^ MlNScORE p (D p [A;], Q), r should be among 
the top-k items in T>' . But since T>' and T> are indistinguishable by algorithm A, which 
stops at step p without outputting r in the result, this contradicts ^4's correctness. 

In this proof, we have ignored MaxScoreUnseen(Q) in the inequalities. The unseen 
items can be simulated by adding one virtual item i v to D, which does not exist and 
will never be encountered in user lists, with MlNScORE(i„, Q) — and 
MAXSCORE^, Q) = MaxSC0REUnseen(<5). Then, the same proof argument applies 
to these items. □ 



34 




■10 s 



llll llll llll 



CMergeTOPKS MVar Hist 

(b) a = 



CMergeTOPKS MVar Hist CMergeTOPKS MVar Hist 

(c) a = 0.1 (d) a = 0.2 

(e) BM15 

■10 6 10" 



CMergeTOPKS MVar Hist 

(e) a = 0.3 



15 

1 i 

£ 0.5 
< 





5.5 

llll Mil llll llll 

I I J 3.5 I J 3.5 | 



3.5 3 



CMergeTOPKS MVar Hist 

(g) a = 



CMergeTOPKS MVar Hist CMergeTOPKS MVar Hist 

(h) a = 0.1 (i) a = 0.2 

(j) tf-idf 



CMergeTOPKS MVar Hist 

(j) " = 0.3 



Figure 10: Abstract cost and running time comparison over the tag-similarity network 
and the f pow , A = 1.1 proximity function (red: top-10, yellow: top-20). 

B. Other a + functions 

We present experimental results for the f pow , A = 1.1, in Figures [TU j [TT1 and [T^]) . While 
the results follow the same trend as those of f mu i, one can notice that the speedups 
achieved by the TOPKS variants are directly affected by the speed of the "drop" in 
proximity values. Generally, f mu i values drop faster than those of f pow . 
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Figure 11: Abstract cost and running time comparison over the item-similarity network 
and the f pow , X = 1.1 proximity function (red: top-10, yellow: top-20). 
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Figure 12: Abstract cost and running time comparison over the item-tag similarity net- 
work and the f pow , A = 1.1, proximity function (red: top-10, yellow: top-20). 
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